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ABSTRACT 

Profiling under UNIX is done by inserting counters into 
programs either before or during the compilation or assembly 
phases. A fourth type of profiling involves monitoring the 
execution of a program, and gathering relevant statistics dur- 
ing the run. This paper looks at this method and an imple- 
mentation of it, and discusses its advantages and disadvan- 
tages. 


Introduction 

There is a saying among programmers, “[m]ake it right before you 
make it faster” 1 . This involves testing the program, usually by running 
it on some test data. But how can a programmer be sure that the test 
data really exercises all paths of control, so every statement is executed 
at least once? And once the programmer is satisfied his program is 
right, how can he tell in what sections of code the program spends most 
of its time? 

Obtaining the answers to these questions require the use of a tool 
called a profiler. This tool will monitor the execution of a program, 
gather statistics on the program execution, and print the results in an 
understandable form. Among the units of a program which can be pro- 
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filed are functions and source lines; one can think of these as being 
large-grained and fine-grained units, the idea being that one profiles the 
function calls to determine in which function the program spends most of 
its time, and then look at a source line profile of the function to deter- 
mine what parts should be rewritten. Several books describe methods for 
using this information to improve program performance ’ . 

There are also two types of statistics that are gathered from profil- 
ing; each has its own uses. The first is timings, which give the number 
of seconds (or clock ticks) spent in each unit. These statistics must be 
read with an understanding of factors that corrupt the timings. Since 
instructions are usually executed far faster than one clock tick per instruc- 
tion, timings are rarely exact; for example, if a subroutine is called and 
returns between clock ticks, the subroutine would not show up in timings. 
Timings also depend on things not related to the program, such as the 
speed of paging and what parts need to be paged in. So, while timings 
are a useful guide, they are not ideal. The second statistic is counts, 
which give the number of times the relevant unit has been executed. 

Counts have the advantage that they are entirely precise; but since the 
units being counted may vary wildly in complexity, they lack the weight- 
ing that timings provide. 

Timing and counting statistics are both generated in the same way. 

Special instructions are placed between the units being monitored, such as 
function or block entry points. When the program runs, this special code 

increments timers or counters, and when the program ends, the 
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information is saved somewhere. The programmer can then analyze this 
information to see the timings and counts that interest him. 

There are three basic ways to implement profiling programs. The 
first is to modify the compiler to generate the special code; the second is 
to use a preprocessor or postprocessor to insert special code in the source 
program or the assembly language produced by the compiler; and the 
third is to use an execution monitor. Traditionally, UNIXf profiling has 
been done using the first method 4 . This method has the disadvantage 

that one needs access to the compiler sources to implement it, and sys- 
tem administrators are as a rule reluctant to replace a working compiler 
with a locally modified one. It has the advantage that no preprocessing 
or postprocessing is needed to add the instructions, and issues such as 
handling the state of the process do not arise since the compiler will deal 

with them. Of late, the second method has also been used 5 , its prob- 

lems are that the postprocessor must preserve condition codes across the 
inserted special code, and in order to work correctly, the postprocessor 

must have an intimate knowledge of the target computer’s assembly 
language. The problems with preprocessors are different; basically, prepro- 
cessors require that the program be parsed and (where necessary) rewrit- 
ten to prevent the special code being inserted from causing syntax errors. 
These methods have the advantage that one need not modify the compiler 
to use them, since they are not a part of the compiler itself. 


fUNIX is a Trademark of Bell Laboratories. 
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Very little attention has been paid to using execution monitors with 
UNIX thus far. This paper will examine the design, implementation, and 
experiences using such a tool. First, we shall discuss how an execution 

monitor works, and then describe the implementation of this tool, and 
some experiences with its use. 

How an Execution Monitor Works 

Use of an execution monitor involves a technique called patching . 
When the execution monitor runs, it starts the program to be profiled 

and immediately suspends it. The monitor then saves instructions at the 
beginning of each unit of the program to be profiled, and replaces them 
with instructions that will cause a fault when executed. When this is 

done, the execution monitor restarts the program to be profiled. When- 
ever a unit is reached, a fault occurs, and control is returned to the exe- 
cution monitor; the execution monitor determines if the fault was caused 

by entry into a unit, and if so increments the counters and timers associ- 
ated with that unit. It then puts back the instruction that it had ear- 

lier replaced, and single steps through the program being profiled until 
some other instruction is executed. The instruction that causes a fault 

then replaces the instruction earlier put back, and the execution monitor 
restarts the program being profiled. 

The technique of modifying the process space of the process being 
profiled is called patching. It is a very powerful technique, and is used 

by dynamic debuggers to enable a programmer to watch what happens as 
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a program is being executed. Depending on the amount of information 
in the symbol table of the object code of the program that is to be pro- 
filed, the profiler can print various types and amounts of information. 
For example, if line numbers were not present and only functions were, 
the source code could not be profiled but functions could be. 

Two questions about this patching procedure immediately come to 
mind. When the illegal instruction and the instruction it replaced are 
exchanged, and the traced program is single stepped, the instruction might 
be re-executed. If this happened, the line count would be incorrect. To 
avoid this error, the execution monitor must check the program counter 

after the single stepping. If the replaced instruction were re-executed, it 
increments the counter for that instruction and repeats the procedure. 
When the program counter shows that some other instruction has been 
executed, the illegal instruction is restored. 

The second question is related. Implicit in this method is the 

assumption that the instruction causing the fault does not change the 

state of the traced process, and in particular the condition codes. Usu- 
ally, this is no problem since illegal instructions cause faults not reflected 
in the condition codes: if there is no such instruction, however, matters 
become far more complicated. If it is possible to write into text space, 

the execution monitor should substitute three instructions rather than one: 

n copy condition code register to location n + k 2 

n - I- fc j 


execute illegal instruction 


i 
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n + k 2 store the former condition codes here 

Then, before allowing the program to continue, the execution monitor 
would have to restore the condition codes in location n + k 2 and then 

replace the contents of location n with what was originally there, and 
copy these instructions over the next ones. This process would have to 
continue until the instruction at location n + k 2 is passed, at which 

point everything can be restored as it was before the instruction at n 
was executed. 

Once the program has finished execution, the execution monitor must 
print the results. There are two ways to do this. The traditional 
method of other profilers running under UNIX has been to dump the 
results in an intermediate file (called mon.out or something similar) and 
provide another program to print the data there in an intelligible format. 
The second is to add the code to print the results to the program being 

profiled. The first approach provides more flexibility, because users can 

examine the raw data directly; no doubt this is why UNIX profilers tend 
to use it. However, UNIX profilers work with a fairly small amount of 
data (namely, counts and timings of function calls) rather than with large 
amounts of data such as counts for each line. Moreover, for an execu- 
tion monitor, adding code to make an intelligible printout adds nothing 
to the program being traced, since this code resides in the monitor itself. 
So the situation is not so clear-cut here. 
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An Implementation of an Execution Monitor 

The execution monitor described above is being implemented in two 
steps, the first of which has been completed and the second of which is 
in progress. The first version, which we shall discuss now, counts the 
number of times each source line is executed; the second version allows 
functions to be counted as well. The basic structure of both versions is 
the same; we shall highlight the differences when we discuss the second 
version. The first version runs on both VAXf and MC 68000 versions of 

4.2 BSD. The second version is being implemented on a VAX running 

4.3 BSD. 

The first step is to locate the beginnings of units to be counted 
within the traced program. This is done by looking at the symbol table. 
When a special debugging option is given, the 4.2 BSD C compiler 
creates symbol table entries for both source file names and line numbers, 
and with each line number provides the address of the first machine 
instruction in that line. One complication is that several line number 

entries may have the same address, for example if a multiline comment is 
present. These are loaded into an array of structures of the form 


t VAX is a Trademark of Digital Equipment Corporation. 
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struct { 


}; 


union { 

unsigned t_val; 
ADDRESS t_tadd; 

} t_lpos; 

WORD t word; 

WORD t ill; 

int t_lno; 
char *t_fnm; 
int t count; 


/* value in symbol table */ 

/* same , treated as an address */ 
/* where the line occurs */ 

/* the word that's there */ 

/* the word with illegal inst. */ 

/* line number */ 

/* pointer to file name */ 

/* count from execution monitor */ 


The types ADDRESS and WORD are defined to be the types of an 


address and a word on the current machine; for example, on a VAX, 


these are 


typedef unsigned int WORD; /* what a machine word is */ 

typedef WORD ^ADDRESS; /* what a machine address is */ 

The field t_word will hold the word at that location, and the field t ill 

will hold the same word but with the instruction being replaced by an 
illegal instruction. All lines are found in one pass over the symbol table. 

The next step is to replace the instructions at the beginning of each 
line with the illegal opcode. In this implementation, we use the opcode 
LDPCTX (“LoaD Process ConTeXt” 7 ), which is a privileged operation 
(and when executed by a user’s program will cause a fault) but which 
does not alter the condition codes after the fault. First, the process to 
be profiled is started after marking that it is to be traced; on the VAX, 
this causes a fault after the first instruction of that process is executed. 
At this time, words are copied from the child process’ memory into the 
array of structures described above, and replaced with words modified 
with the illegal instruction at the address indicated by the line number. 
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(Use of words rather than bytes is necessary, even on a byte-addressed 

machine like the VAX, because the ptracc call reads and writes only 

words.) 

Now, the profiled process is ready to run. It is signaled to continue, 
and the execution monitor waits for a fault or termination. If the child 
terminated, the program analyzes the results. If it faults, the execution 
monitor determines what signal caused the fault and where the program 
faulted. If the fault was not an illegal instruction, or the address is not 
that of a line, the execution monitor will attempt to force the child pro- 
cess to continue as though it had received that fault. (This usually 

results in that process terminating, possibly with a core dump.) Otherwise, 

the execution monitor adds 1 to the t count fields of all lines with that 

address in t Ipos. It copies the t word field of the appropriate entry in 

the array into the traced process’ text space, and then single steps, check- 
ing each step until the instruction has been passed. The appropriate 

t ill field is copied into the profiled program’s instruction space. Now, 

the new program counter value must be compared to the addresses of the 
line numbers, lest two lines occupy less than one machine word. If this 
is true, the entire procedure is repeated using the new instruction and 
line number. If not, program execution continues. 

Printing in this version is done by the execution monitor; the user 
can request line counts, a full histogram, or a scaled histogram. The 
basic scheme is the same for all formats — simply traverse the array of 
line numbers and print the counts. In all cases, the usual format is to 
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print the counts followed by the source file lines. Here is a sample of 

output from this program; the program simply generates an array of 1000 

9 

numbers and sorts them using a Shell sort : 

CTRACE Version 1.3 (July 25, 1983) 


FILE 

LINE COUNT 

x.c 

1 

0 

^define MAX 1000 

x.c 

2 

0 


x.c 

3 

0 

main() 

x.c 

4 

1 

{ 

x.c 

5 

1 

register int i; 

x.c 

6 

1 

int list [MAX]; 

x.c 

7 

1 

long randomQ ; 

x.c 

8 

1 


x.c 

9 

1 

srandom(getpid() ) ; 

x.c 

10 

1 

for(i = 0; i < MAX; i++) 

x.c 

11 

1000 

list [i] — randomQ; 

x.c 

12 

1 


x.c 

13 

1 

shell(list, MAX); 

x.c 

14 

1 

} 

x.c 

15 

0 


x.c 

16 

0 

shell(v, n) 

x.c 

17 

0 

int v[], n; 

x.c 

18 

1 

{ 

x.c 

19 

1 

register int i, j, gap, temp; 

x.c 

20 

1 


x.c 

21 

1 

for(gap = n/2; gap > 0; gap /= 2) 

x.c 

22 

9 

for(i = gap; i < n; i4-+) 

x.c 

23 

8006 

for(j=i-gap; j>=0 && v[j]>v| 

x.c 

24 

7319 

temp = v[j]; 

x.c 

25 

7319 

v[j] = v[j+gapj: 

x.c 

26 

7319 

v [j+gap] = temp; 

x.c 

27 

7319 

} 

x.c 

28 

1 


x.c 

29 

1 

} 

Note 

that 

the counts must be interpreted properly. 


For example, look at 
the “for” loop in lines 10—11. Even though the count is 1, the test in 
the “for” statement is executed 1000 times; the problem is that the 4.2 
BSD C compiler puts the symbol for the line number at the machine 
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instruction generated for the initialization, and the next line number is for 
that of the loop. Unfortunately, fixing this would require the compiler to 
be changed. 

The Next Version 

This version works on principles similar to the first version, but will 
permit functions in the symbols table to be profiled. This is of more use 
than the profiling of lines, since one need not have compiled the program 
with debugging information, and need not have the source available. 
However, it requires information about how different machines handle 
function calls. Some, such as the MC 68000, begin at the address stored 
in the symbol table. In this case, the illegal instruction can be placed at 
the address of the function. Others, such as the VAX, begin execution at 
the word after the address of the function (the word at the address is 
used to indicate what registers should be saved, among other things). In 
these cases, the illegal instruction must be placed at the first word exe- 
cuted upon entry into the function. 

The second difference is that the user will be able to specify what 
lines, source files, and functions are to be profiled. One of the main 
problems with the first version is that a signal trap occurred on every 
line. In the second version, this will only be true with the specific parts 


that the user wants to trace. 
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Comparison of Profiling Methods 

The discussion in the introduction pointed out some problems with 
various methods of profiling: having the compiler generate counters and 
timers, preprocessing programs and inserting profiling code; postprocessing 
assembly language output from the compiler and inserting profiling code; 
and using an execution monitor. The question of which method is best 
cannot be answered simply; to a large degree, it depends on what tools 
are available and what information is desired. 

First, if the user wants to generate counts for each source line, using 
compiler-generated code is probably not an option, since most UNIX com- 
pilers do not provide such statistics. Preprocessing programs solves the 
problems posed by condition codes, since the compiler takes care of them; 
but such programs require at minimum a parser (to ensure adding the 

profiling statements does not produce a syntax error.) Postprocessing has 

the problem with condition codes, and requires a knowledge of the 

machine’s assembly language instructions as well as the code generated by 
the assembler: for example, the type of branch instruction used on many 
machines (such as the VAX) depends on how far a branch may occur. 

Patching requires only that one be able to extract the program counter 
from the address space of the process being traced. So from the pro- 
gramming point of view, patching is easier to program. 

From the user’s point of view, patching is the most flexible method 
but the slowest. Using patching, one can profile one section of the pro- 
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gram, and then profile a completely different section without having to 
recompile the program. None of the other three methods of profiling 

allow this; all would require recompilation. Only patching allows any pro- 
filing without compiling special code; all other methods add code before 
assembly; as a result, to profile using these methods, previously compiled 
programs must be recompiled. While patching will only allow you to 
profile those units saved in the symbol table, in most cases this includes 
functions, which are very often the main units of interest. 

Finally, should the profiled program terminate abnormally (say, with 
a bus error), other UNIX profiling packages will not allow the user to 
obtain a profile because the intermediate file is either not written out or 
corrupt. ( Gprof generated an intermediate file, but core-dumped; prof did 
not generate any intermediate file.) Correcting this problem would not 
always be possible, since some events causing abnormal termination cannot 
be trapped (for example, the signal SIGKILL). An execution monitor, 
however, can easily determine why the profiled process stopped, and since 
the statistics gathered are in the process space of the monitor rather than 
the profiled program, the requisite statistics can be generated. 

A Wish List 

There are a few changes to the kernel that impose limits on what an 
execution monitor can do. The major bottleneck is the system call 
ptrace, which is the mechanism used to control the execution of the pro- 
filed program. Its main problem is that only children may be controlled, 


14 


and only children started up after the execution monitor has begun can 
be profiled. This poses several problems. First, only the parent part of 
a process that forks can be monitored; children are on their own. 
Second, it is not possible to monitor a program that is already running 
(such as the kernel.) Third, every signal will cause a trap to the execu- 
tion monitor; it should be possible to instruct the process being profiled 
to treat certain signals normally rather than having the profiled program 
return control to the monitor. Finally, the ptraee mechanism is itself 
cumbersome and slow, and should be replaced with something more 
elegant and faster. Not being able to obtain timing information from a 
child process which has not terminated is also a problem. Were this not 
so, the execution monitor would be able to provide timing statistics as 
well as counts, by obtaining timings at each unit and subtracting. (In 
some cases, extra illegal instructions would need to be inserted; for exam- 
ple, at the end of functions as well as at the beginning.) A third useful 
feature would be automatically preserving condition codes when a fault 
occurs, and restoring them when execution resumes. This problem can 
usually be circumvented by choosing the instructions to place in the pro- 
filed process’ text space appropriately, but it would be better not to have 
to worry about this at all. 

Many of these features would be useful in contexts other than profil- 

-t o 

ing; for example, in debugging . Some manufacturers of multiprocessing 
machines have already made some of these changes. f 


t For example, the ptraee system call for Dynix 2.0, by Sequent Computer Systems, Inc., will al- 
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Conclusion 

Patching is a very powerful method of profiling. It allows any exe- 
cutable program with a symbol table to be profiled, and the more func- 
tions and source line numbers in the symbol table, the more that can be 
profiled. It does not rely on the existence of either assembly language 
source files or higher level language source files; indeed, even if the source 
is unavailable, the program can be profiled! Its drawback, that it causes 
the profiled program to run very slowly, can be ameliorated by judiciously 
choosing the units, and sections of code, to be profiled. 
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